-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35730: [C++] Add the ability to specify custom schema on a dataset write #35860
GH-35730: [C++] Add the ability to specify custom schema on a dataset write #35860
Conversation
|
Starting in draft stage as I believe we should get some python unit tests in here and @anjakefala volunteered to do some. |
So far, I added 2 basic tests, based on my understanding of the feature! The basic case where you write a single table, which contains a field with nullability specified, passes. Note that this one:
is failing for now. I did not specify nullability in the table's schema, but then specified it in Is it expected that the returned dataset would have a field with nullability? In this example
The resulting schema also does not have nullability. |
I have confirmed that
has a field with nullability. The following do not:
or
|
…We can, however, create an InMemoryDataset with mixed field metadata and so I created a test for that
These lines failed for me with the following error:
I thought this was supported and it took me a moment to track down what was going on. The error is actually being raised before the C++ call to write the dataset. Pyarrow is taking the two inputs ( If this is the same error you were getting then I think we can call this an invalid scenario and we don't have to support it (at least for this PR. Arguably, you could evolve a table into the correct schema if adding it to an InMemoryDataset but that's a different feature). This is kind of confusing because @anjakefala and I were testing earlier and you are allowed to create an |
Oh, and thank you for adding the test case :) |
Good to know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
I just have one question about whether we should also validate the names of the schema; the rest are minor code style nitpicks in the test code, for which I can also push some changes.
} | ||
for (int field_idx = 0; field_idx < input_schema->num_fields(); field_idx++) { | ||
if (!input_schema->field(field_idx)->type()->Equals( | ||
custom_schema->field(field_idx)->type())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test that the names of the fields are equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the names should be safe. Admittedly, a user could also do this name change by inserting a project node before the write node.
I could be convinced otherwise but I don't think this does any harm and I think, as a user, I would expect this behavior, so it wouldn't be surprising that the names changed.
cpp/src/arrow/dataset/file_base.cc
Outdated
|
||
if (custom_schema != nullptr) { | ||
if (custom_schema->num_fields() != input_schema->num_fields()) { | ||
return Status::Invalid( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status::TypeError
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched. Although I will confess I'm not entirely clear on when to use one over the other. In my mind TypeError
is only for cases where a dynamic_cast
fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to be honest for having a wrong number of fields, I would also raise a ValueError/Invalid (it's a wrong value, not a wrong type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A schema error is a type error IMHO.
cpp/src/arrow/dataset/file_base.cc
Outdated
for (int field_idx = 0; field_idx < input_schema->num_fields(); field_idx++) { | ||
if (!input_schema->field(field_idx)->type()->Equals( | ||
custom_schema->field(field_idx)->type())) { | ||
return Status::Invalid("The provided custom_schema specified type ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched.
|
||
/// \brief Options to control how to write the dataset | ||
FileSystemDatasetWriteOptions write_options; | ||
/// \brief Optional metadata to attach to written batches | ||
std::shared_ptr<const KeyValueMetadata> custom_metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of removing this option (as a breaking change), we could in theory still allow the user to specify one of both?
(I am not using the C++ API for this, so I don't know how useful this would be / how cumbersome it is to specify the schema if you only want to specify metadata. From the DatasetWriter point of view, this is a fine change of course since there we already have the full schema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it were a new feature I would argue it's not worth it (A user could technically use DeclarationToSchema
to get the output schema of the plan leading up to the write and then attach custom metadata to that). However, given we have already released custom_metadata
, and I would like Acero's API to start being stable, I suppose I should set an example. Thanks for the nudge. I have restored custom_metadata
The failing test on AMD64 is unrelated (flight ucx test). There is still one R test running but I think this is good to go now. |
cpp/src/arrow/dataset/file_base.h
Outdated
@@ -34,6 +34,7 @@ | |||
#include "arrow/filesystem/filesystem.h" | |||
#include "arrow/io/file.h" | |||
#include "arrow/util/compression.h" | |||
#include "arrow/util/key_value_metadata.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary if arrow/type_fwd.h
is included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this include and explicitly included arrow/type_fwd.h
.
@jorisvandenbossche Do you want to take a look at the Python test changes? |
Co-authored-by: Antoine Pitrou <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the python test changes!
Weston took out the invalid test case, and added one that covers how the field metadata gets applied.
There were no changes to this since my previous review (and push), so all good! |
I'm going to go ahead and merge if green. I think it's quite late for Joris and I believe Raul will be starting the patch build soon. If Joris finds an issue later then we can still scrap the patch release candidate for a fix (though hopefully we don't need to). |
… write (#35860) ### Rationale for this change The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was added to support saving schema metadata. However, it doesn't capture field metadata or field nullability. This PR replaces that capability with the ability to specify a custom schema instead. The custom schema must have the same number of fields as the input to the write node and each field must have the same type. ### What changes are included in this PR? Added `custom_schema` to `WriteNodeOptions` and removed `custom_metadata`. ### Are these changes tested? Yes, I added a new C++ unit test to verify that the custom info is applied to written files. ### Are there any user-facing changes? No. Only new functionality (which is user facing) * Closes: #35730 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Nic Crane <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: anjakefala <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Weston Pace <[email protected]>
@github-actions crossbow submit test-r-ubuntu-22.04 |
Revision: 875dfed Submitted crossbow builds: ursacomputing/crossbow @ actions-42981f190e
|
Benchmark runs are scheduled for baseline = 3fe4a31 and contender = 018e7d3. 018e7d3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…ataset write (apache#35860) The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was added to support saving schema metadata. However, it doesn't capture field metadata or field nullability. This PR replaces that capability with the ability to specify a custom schema instead. The custom schema must have the same number of fields as the input to the write node and each field must have the same type. Added `custom_schema` to `WriteNodeOptions` and removed `custom_metadata`. Yes, I added a new C++ unit test to verify that the custom info is applied to written files. No. Only new functionality (which is user facing) * Closes: apache#35730 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Nic Crane <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: anjakefala <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Weston Pace <[email protected]>
Rationale for this change
The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was added to support saving schema metadata. However, it doesn't capture field metadata or field nullability. This PR replaces that capability with the ability to specify a custom schema instead. The custom schema must have the same number of fields as the input to the write node and each field must have the same type.
What changes are included in this PR?
Added
custom_schema
toWriteNodeOptions
and removedcustom_metadata
.Are these changes tested?
Yes, I added a new C++ unit test to verify that the custom info is applied to written files.
Are there any user-facing changes?
No. Only new functionality (which is user facing)